External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010
نویسندگان
چکیده
We present our hybrid system for the PAN challenge at CLEF 2010. Our system performs plagiarism detection for translated and non-translated externally as well as intrinsically plagiarized document passages. Our external plagiarism detection approach is formulated as an information retrieval problem, using heuristic post processing to arrive at the final detection results. For the retrieval step, source documents are split into overlapping blocks which are indexed via a Lucene instance. Suspicious documents are similarly split into consecutive overlapping boolean queries which are performed on the Lucene index to retrieve an initial set of potentially plagiarized passages. For performance reasons queries might get rejected via a heuristic before actually being executed. Candidate hits gathered via the retrieval step are further post-processed by performing sequence analysis on the passages retrieved from the index with respect to the passages used for querying the index. By applying several merge heuristics bigger blocks are formed from matching sequences. German and Spanish source documents are first translated using word alignment on the Europarl corpus before entering the above detection process. For each word in a translated document several translations are produced. Intrinsic plagiarism detection is done by finding major changes in style measured via word suffixes after the documents have been partitioned by an linear text segmentation algorithm. Our approach lead us to the third overall rank with an overall score of 0.6948.
منابع مشابه
Plagiarism Detection Using Information Retrieval and Similarity Measures Based on Image Processing Techniques - Lab Report for PAN at CLEF 2010
This paper describes the Barcelona Media Innovation Center participation in the 2nd International Competition on Plagiarism Detection. Particularly, our system focused on the external plagiarism detection task, which assumes the source documents are available. We present a two-step a approach. In the first step of our method, we build an information retrieval system based on Solr/Lucene, segmen...
متن کاملCoReMo System (Contextual Reference Monotony) - Lab Report for PAN at CLEF 2010
In this paper a new approach is shown for a very fast monolingual external plagiarism detection system based on an altered n-gram concept (contextual n-gram), a new high precision contextual Information Retrieval engine, and a new pruning strategy (Referential Monotony) for plagiarism detection and its limits. The assessment results can be compared with the carried out by the winner team at PAN...
متن کاملFuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection - Lab Report for PAN at CLEF 2010
This report explains our plagiarism detection method using fuzzy semantic-based string similarity approach. The algorithm was developed through four main stages. First is pre-processing which includes tokenisation, stemming and stop words removing. Second is retrieving a list of candidate documents for each suspicious document using shingling and Jaccard coefficient. Suspicious documents are th...
متن کاملImproving the Reliability of the Plagiarism Detection System - Lab Report for PAN at CLEF 2010
In this paper we describe our approach at the PAN 2010 plagiarism detection competition. We refer to the system we have used in PAN’09. We then present the improvements we have tried since the PAN’09 competition, and their impact on the results on the development corpus. We describe our experiments with intrinsic plagiarism detection and evaluate them. We then discuss the computational cost of ...
متن کاملPAN 2010 : Detecting External Plagiarism Lab Report for Pan at CLEF 2010
This paper presents our approach to detect plagiarism in the PAN’10 competition. To accomplish this task we applied a method which aims at detecting external plagiarism cases. The method is specially designed to detect crosslanguage plagiarism and is composed by five phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. ...
متن کامل